157 research outputs found

    Time Aware Mining of Itemsets

    Get PDF
    International audienceFrequent behavioural pattern mining is a very important topic of knowledge discovery, intended to extract correlations between items recorded in large databases or Web acces logs. However, those databases are usually considered as a whole and hence, itemsets are extracted over the entire set of records. Our claim is that possible periods, hidden within the structure of the data and containing compact itemsets, may exist. These periods, as well as the itemsets they contain, might not be found by traditional data mining methods due to their very weak support. Furthermore, these periods might be lost depending on an arbitrary division of the data. The goal of our work is to find itemsets that are frequent over a specific period but would not be extracted by traditional methods since their support is very low over the whole dataset. In this paper, we introduce the definition of solid itemsets, which represent a coherent and compact behavior over a specific period, and we propose SIM, an algorithm for their extraction. This work may find many applications in sensitive domains such as fraud or intrusion detection

    Atypicity Detection in Data Streams: a Self-Adjusting Approach

    Get PDF
    International audienceOutlyingness is a subjective concept relying on the isolation level of a (set of) record(s). Clustering-based outlier detection is a field that aims to cluster data and to detect outliers depending on their characteristics (i.e. small, tight and/or dense clusters might be considered as outliers). Existing methods require a parameter standing for the "level of outlyingness", such as the maximum size or a percentage of small clusters, in order to build the set of outliers. Unfortunately, manually setting this parameter in a streaming environment should not be possible, given the fast time response usually needed. In this paper we propose WOD, a method that separates outliers from clusters thanks to a natural and effective principle. The main advantages of WOD are its ability to automatically adjust to any clustering result and to be parameterless

    Discovering Frequent Behaviors: Time is an Essential Element of the Context

    Get PDF
    International audienceOne of the most popular problems in usage mining is the discovery of frequent behaviors. It relies on the extraction of frequent itemsets from usage databases. However, those databases are usually considered as a whole and therefore, itemsets are extracted over the entire set of records. Our claim is that possible subsets, hidden within the structure of the data and containing relevant itemsets, may exist. These subsets, as well as the itemsets they contain, depend on the context. Time is an essential element of the context. The users' intents will differ from one period to another. Behaviors over Christmas will be different from those extracted during the summer. Unfortunately, these periods might be lost because of arbitrary divisions of the data. The goal of our work is to find itemsets that are frequent over a specific period but would not be extracted by traditional methods since their support is very low over the whole dataset. We introduce the definition of solid itemsets, which represent coherent and compact behaviors over specific periods, and we propose SIM, an algorithm for their extraction

    Découverte de motifs d'évolution significatifs dans les séries temporelles d'images satellites

    Get PDF
    International audienceLes séries temporelles d'images satellites (ou Satellite Image Time Series - SITS) sont d'importantes sources d'informations sur l'évolution du territoire. Étudier ces images permet de comprendre les changements sur des zones précises mais aussi de découvrir des schémas d'évolution à grande échelle. Toutefois, découvrir ces phénomènes impose de répondre à plusieurs défis qui sont liés aux caractéristiques des SITS et à leurs contraintes. Premièrement, chaque pixel d'une image satellite est décrit par plusieurs valeurs (les niveaux radiométriques sur différentes longueurs d'ondes). Deuxièmement, ces motifs d'évolution portent sur des périodes très longues et ne sont pas forcément synchrones selon les régions. Troisièmement, les régions qui ne sont pas concernées par des évolutions signiticatives sont majoritaires et leur domination rend difficile l'extraction des motifs d'évolution. Dans cet article, nous proposons une méthode qui répond à ces difficultés et nous la validons sur une série d'images satellites acquises sur une période de 20 ans

    Web Usage Mining : extraction de périodes denses à partir des logs

    Get PDF
    National audienceLes techniques de Web Usage Mining existantes sont actuellement basées sur un découpage des données arbitraire (e.g. "un log par mois") ou guidé par des résultats supposés (e.g. "quels sont les comportements des clients pour la période des achats de Noël ? "). Ces approches souffrent des deux problèmes suivants. D'une part, elles dépendent de cette organisation arbitraire des données au cours du temps. D'autre part elles ne peuvent pas extraire automatiquement des "pics saisonniers" dans les données stockées. Nous proposons d'exploiter les données pour découvrir de manière automatique des périodes "denses" de comportements. Une période sera considérée comme "dense" si elle contient au moins un motif séquentiel fréquent pour l'ensemble des utilisateurs qui étaient connectés sur le site à cette période

    Maximally Informative k-Itemset Mining from Massively Distributed Data Streams

    Get PDF
    International audienceWe address the problem of mining maximally informative k-itemsets (miki) in data streams based on joint entropy. We propose PentroS, a highly scalable parallel miki mining algorithm. PentroS renders the mining process of large volumes of incoming data very efficient. It is designed to take into account the continuous aspect of data streams, particularly by reducing the computations of need for updating the miki results after arrival/departure of transactions to/from the sliding window. PentroS has been extensively evaluated using massive real-world data streams. Our experimental results confirm the effectiveness of our proposal which allows excellent throughput with high itemset length

    A Distributed Collaborative Filtering Algorithm Using Multiple Data Sources

    Get PDF
    International audienceCollaborative Filtering (CF) is one of the most commonly used recommendation methods. CF consists in predicting whether, or how much, a user will like (or dislike) an item by leveraging the knowledge of the user's preferences as well as that of other users. In practice, users interact and express their opinion on only a small subset of items, which makes the corresponding user-item rating matrix very sparse. Such data sparsity yields two main problems for recommender systems: (1) the lack of data to effectively model users' preferences, and (2) the lack of data to effectively model item characteristics. However, there are often many other data sources that are available to a recommender system provider, which can describe user interests and item characteristics (e.g., users' social network, tags associated to items, etc.). These valuable data sources may supply useful information to enhance a recommendation system in modeling users' preferences and item characteristics more accurately and thus, hopefully, to make recommenders more precise. For various reasons, these data sources may be managed by clusters of different data centers, thus requiring the development of distributed solutions. In this paper, we propose a new distributed collaborative filtering algorithm, which exploits and combines multiple and diverse data sources to improve recommendation quality. Our experimental evaluation using real datasets shows the effectiveness of our algorithm compared to state-of-the-art recommendation algorithms

    When sharing computer science with everyone also helps avoiding digital prejudices: Escape computer dirty magic: learn Scratch !

    Get PDF
    International audienceWe, computer scientists, have to increase human knowledge, e.g. help to better understand what is mechanical intelligence. We further have to contribute to generate good goods out of it. But we also have the duty to share this knowledge with everyone. To be sure that no one endure, whereas each one benefits from the derived technology. And we observe that the main challenge is : free people from preconception, misconception and prejudices about computers and robots. And the main enemy is “belief in artificial intelligence phantasms” ! To escape from such deleterious magic, creative computing with Scratch is our best friend, unplug activities our best lever and playful robotics our best partner. Let us tell it as a story.

    Discovering Tight Space-Time Sequences

    Get PDF
    International audienceThe problem of discovering spatiotemporal sequential patterns affects a broad range of applications. Many initiatives find sequences constrained by space and time. This paper addresses an appealing new challenge for this domain: find tight space-time sequences, i.e., find within the same process: i) frequent sequences constrained in space and time that may not be frequent in the entire dataset and ii) the time interval and space range where these sequences are frequent. The discovery of such patterns along with their constraints may lead to extract valuable knowledge that can remain hidden using traditional methods since their support is extremely low over the entire dataset. We introduce a new Spatio-Temporal Sequence Miner (ST SM) algorithm to discover tight space-time sequences. We evaluate ST SM using a proof of concept use case. When compared with general spatial-time sequence mining algorithms (GST SM), ST SM allows for new insights by detecting maximal space-time areas where each pattern is frequent. To the best of our knowledge, this is the first solution to tackle the problem of identifying tight space-time sequences
    • …
    corecore